The literature review identified three main methodological categories for synthetic data generation in sports science which include GAN-based, simulation-based and statistical models.
These three categories describe the primary routes through which synthetic sports data are produced each emphasizing different balances between realism reproducibility and computational complexity.
The analysis of public sports datasets revealed that most datasets center around athletes (no professional) populations while a smaller number focus on elite specific sports. Sports representation is diverse but most datasets cover multiple sports rather than a single discipline. Specific emphasis appears in basketball, cricket, football and fitness (general exercises) which frequently integrate tabular and video data types. Video and tabular (game or player statistical) formats constitute the majority of available datasets followed by accelerometer survey and physiological records. Physiological and video data also correspond to the largest sample sizes and exhibit higher methodological validity within the datasets analysed.
The geographic distribution of public sports datasets demonstrates that the United States China and Australia are the principal contributors. The data sources are organized into three main publication types academic articles, online repositories such as Kaggle, and curated programming packages. Articles represent the largest proportion emphasizing peer-reviewed studies and structured experimental datasets. Kaggle datasets correspond to open competitions or general publication of datasets by members, whereas package datasets are originated from software libraries that centralize standardized data tables for reproducible research.
An examination of dataset connectivity between sports and variable types revealed that the majority of datasets that involve multiple sports rely on video and/or tabular variables. Accelerometer and physiological data are mainly used in composite datasets that combine specific or track longitudinal performance. Biomechanical variables occur only in a few specialized datasets dedicated to controlled simulations such as cycling or impact training.
The ranking comparison identifies GAN-based datasets primarily rely on video and image data focused on athlete performance and activity detection, while Statistical datasets encompass tabular, physiological and survey-based data. The evaluation criteria highlights datasets most suitable for each approach, supporting the selection of appropriate data sources for developing synthetic datasets using Statistical or GAN-based methods.
| Category | Number of Datasets | Data Types | Population | Most Frequent Sports | Top 3 (by Score) |
|---|---|---|---|---|---|
| GAN-based | 16 | Video, Image | Athlete | Multiple, Basketball, Fitness | TeamTrack, C-Sports, SportsMOT |
| Statistical | 34 | Tabular, Physiological, Medical Record, Survey, Accelerometer | Athlete, Multiple | Football, Baseball, Basketball, Fitness | MTS-5, NCAA-ISP, LLBD |